Skip to content

Conversation

@surajssd
Copy link
Member

@surajssd surajssd commented Oct 15, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

  • Update the ExecStart path from /usr/local/bin to /usr/bin to match the actual installation location of the nvidia-device-plugin binary. This ensures the service can properly start the plugin with MIG strategy configuration.

  • e2e tests:

    • Made GPU validation functions configurable: Updated ValidateNodeAdvertisesGPUResources() and ValidateGPUWorkloadSchedulable() to accept GPU count parameters instead of hardcoded values
    • Updated existing tests: All managed experience GPU tests now explicitly specify expected GPU count (1) for better validation
    • New Test_Ubuntu2404_NvidiaDevicePluginRunning_MIG validates MIG functionality on Ubuntu 24.04 with Standard_NC24ads_A100_v4 VMs
    • Added MIG-specific validators:
      • ValidateMIGModeEnabled() - Verifies MIG mode is enabled via nvidia-smi
      • ValidateMIGInstancesCreated() - Confirms MIG instances are properly created with expected profiles
    • Improved DCGM Validation:
      • Configurable metrics: ValidateNvidiaDCGMExporterScrapeCommonMetric() now accepts metric parameter instead of hardcoding
      • Metric selection: Uses DCGM_FI_DEV_GPU_UTIL for standard GPU tests and DCGM_FI_DEV_GPU_TEMP for MIG-enabled tests
    • Enhanced SSH retry logic:
      • Configurable retry mechanism: New WaitForSSHAfterReboot config option enables SSH retry with exponential backoff for scenarios where nodes reboot during provisioning
      • Reboot detection: Identifies common SSH error patterns indicating system reboots
      • Backward compatibility: Maintains existing single-attempt behavior when retry is not configured
      • Applied to MIG test: 5-minute retry window configured for Ubuntu 24.04 MIG GPU test to handle driver installation reboots

Which issue(s) this PR fixes:

#7063 (comment)

Copy link
Contributor

@ganeshkumarashok ganeshkumarashok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an E2E for checking this? That would have failed if not for this change.

@surajssd
Copy link
Member Author

Can we add an E2E for checking this? That would have failed if not for this change.

PTAL.

@surajssd surajssd force-pushed the suraj/fix-device-plugin-path-for-mig branch from f49513a to eea238b Compare October 16, 2025 22:08
@surajssd surajssd force-pushed the suraj/fix-device-plugin-path-for-mig branch from eea238b to 3429b4f Compare October 17, 2025 00:02
@surajssd surajssd force-pushed the suraj/fix-device-plugin-path-for-mig branch from 3429b4f to 81ac586 Compare October 17, 2025 03:54
@surajssd surajssd force-pushed the suraj/fix-device-plugin-path-for-mig branch from 81ac586 to dff3b84 Compare October 17, 2025 16:41
@surajssd surajssd force-pushed the suraj/fix-device-plugin-path-for-mig branch from dff3b84 to 4fea6e1 Compare October 17, 2025 23:47
Update the ExecStart path from `/usr/local/bin` to `/usr/bin` to match the
actual installation location of the nvidia-device-plugin binary. This ensures
the service can properly start the plugin with MIG strategy configuration.

Signed-off-by: Suraj Deshmukh <[email protected]>
- Add gpuCountExpected parameter to ValidateNodeAdvertisesGPUResources() to
  validate exact GPU count instead of just checking > 0
- Add gpuCount parameter to ValidateGPUWorkloadSchedulable() to make GPU
  resource request configurable
- Update all test callers to pass expected GPU count of 1
- Improve logging to show actual vs expected GPU counts for better debugging

Signed-off-by: Suraj Deshmukh <[email protected]>
- Add Test_Ubuntu2404_NvidiaDevicePluginRunning_MIG to validate MIG
  functionality
- Configure test with Standard_NC24ads_A100_v4 VM size and MIG2g instance
  profile
- Add ValidateMIGModeEnabled validator to check MIG mode is enabled via
  nvidia-smi
- Add ValidateMIGInstancesCreated validator to verify MIG instances are properly
  created
- Test validates device plugin, DCGM exporter, and GPU resource scheduling with
  MIG

Signed-off-by: Suraj Deshmukh <[email protected]>
Modified ValidateNvidiaDCGMExporterScrapeCommonMetric to accept a metric
parameter instead of hardcoding DCGM_FI_DEV_GPU_UTIL. Updated test calls to
specify the appropriate metric for each scenario:
- Use DCGM_FI_DEV_GPU_UTIL for standard GPU tests (Ubuntu 24.04, Ubuntu 22.04,
  AzureLinux3)
- Use DCGM_FI_DEV_GPU_TEMP for MIG-enabled tests (Ubuntu 24.04 MIG)

This allows for more flexible validation based on the specific GPU configuration
being tested.

Signed-off-by: Suraj Deshmukh <[email protected]>
Add configurable SSH connectivity retry mechanism for scenarios where nodes may
reboot during provisioning (e.g., MIG-enabled GPU nodes).

- Add WaitForSSHAfterReboot config option to enable retry behavior
- Implement retry logic with exponential backoff using
  wait.PollUntilContextTimeout
- Add reboot detection for common SSH error patterns
- Enable 5-minute retry window for Ubuntu 2404 MIG GPU test
- Maintain backward compatibility with existing single-attempt behavior

This resolves flaky test failures when GPU driver installation triggers
automatic reboots during node provisioning.

Signed-off-by: Suraj Deshmukh <[email protected]>
Signed-off-by: Suraj Deshmukh <[email protected]>
@surajssd surajssd force-pushed the suraj/fix-device-plugin-path-for-mig branch from 4fea6e1 to 52b7831 Compare October 20, 2025 16:27
@ganeshkumarashok ganeshkumarashok merged commit fd46cb8 into master Oct 20, 2025
35 of 38 checks passed
@ganeshkumarashok ganeshkumarashok deleted the suraj/fix-device-plugin-path-for-mig branch October 20, 2025 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants